-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan Optimizations and Fixes #8959
Conversation
…ng and more * fixed default sampling queue to include p_step * changed sampling queue display to better reflect the actual logic * added VK-specific settings `use_mmap_vk`, `flash_attn_vk`, `no_kv_offload_vk` * added new presets for testing
I missed a validation issue in #8943, but the fix is now in this branch. I think this should be ready for a review and then merge. |
sz * FLOAT_TYPE((data_a[ib0 + i].scales[v_im + 4] & 0x0f) | ((data_a[ib0 + i].scales[v_im] & 0xc0) >> 2)) + sw * FLOAT_TYPE((data_a[ib0 + i].scales[v_im + 5] & 0x0f) | ((data_a[ib0 + i].scales[v_im + 1] & 0xc0) >> 2))) - dmin * smin); | ||
const uint tmp_idx = 16 * ix + tid; | ||
tmp[tmp_idx] = fma(dall, (fma(sx, FLOAT_TYPE(data_a[ib0 + i].scales[v_im] & 0x3f), fma(sy, FLOAT_TYPE(data_a[ib0 + i].scales[v_im + 1] & 0x3f), | ||
fma(sz, FLOAT_TYPE((data_a[ib0 + i].scales[v_im + 4] & 0x0f) | ((data_a[ib0 + i].scales[v_im] & 0xc0) >> 2)), fma(sw, FLOAT_TYPE((data_a[ib0 + i].scales[v_im + 5] & 0x0f) | ((data_a[ib0 + i].scales[v_im + 1] & 0xc0) >> 2))))))), fma(-dmin, smin, tmp[tmp_idx])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you consider only the FMA changes, is there a measurable performance gain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very hard to tell. The GLSL compiler should be using FMA instructions anyways, basically this change just makes it certain instead of leaving it to the optimizer. Hopefully this means a few more FMA calls in SPIR-V, which could be checked.
But afterwards the SPIR-V code gets compiled again to a device-specific driver-internal representation, where some more optimization takes place. Since there are many combinations of devices, I can't really be sure whether this helped anywhere, but at least I'm sure it doesn't cause slow downs. I haven't seen a significant performance difference on my devices.
@ggerganov @slaren Can one of you review the non-Vulkan parts of this PR and approve if that's fine? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to fix the CI before merging
Vulkan Optimizations and Fixes (ggerganov#8959)
* Optimize Vulkan REPEAT performance * Use Vulkan GLSL fused multiply-add instruction where possible * Add GGML_VULKAN_PERF option to output performance data per operator * Rework and fix Vulkan descriptor set and descriptor pool handling * Fix float32 concat f16 shader validation error * Add Vulkan GROUP_NORM eps parameter * Fix validation error with transfer queue memory barrier flags * Remove trailing whitespaces
* Optimize Vulkan REPEAT performance * Use Vulkan GLSL fused multiply-add instruction where possible * Add GGML_VULKAN_PERF option to output performance data per operator * Rework and fix Vulkan descriptor set and descriptor pool handling * Fix float32 concat f16 shader validation error * Add Vulkan GROUP_NORM eps parameter * Fix validation error with transfer queue memory barrier flags * Remove trailing whitespaces
I have implemented a number of Vulkan optimizations and fixes:
I will keep this on draft while I check a few more things, but feel free to test and benchmark. Don't expect a huge difference.